1 Introduction

This R Markdown script contains all the code used for outlier detection, data analysis, and plotting. It includes the description for materials, methods, additional statistical analyses, Monte Carlo simulations, and all the statistical models with summaries.

2 Experiment 1

2.1 Stimuli

Exp1 (identification task) tests NMS’ implicit knowledge of Māori words with varying frequencies. The stimulus materials for the identification task consist of 1,000 Māori words and 1,000 Māori-like nonwords. To obtain a list of words, all words in two Māori running speech data (32, 33) are divided into five frequency bins and 200 words are randomly selected from each of those bins. A list of 10,000 pseudowords (1,000 pseudowords for each phoneme length of nonwords ranging from length 3 to length 12) were generated from a trigram language model by a pseudoword generator (34) using a Māori dictionary (35) and two running speech corpora (32, 33) as training sets. For each word, a nonword is chosen by matching its length and phonotactic score. The phonotactic scores of stimuli are computed using type frequency obtained from a Māori dictionary (35). A trigram model is selected in order to take positional information within words into account, including an additional symbol to indicate word boundaries while building a language model. The phonotactic score of each stimulus is obtained by summing log-transformed transitional probabilities which better reflect frequency distributions than raw probabilities (36) and then is normalized by length. The Witten-Bell smoothing in SRILM (SRI Language Modeling Toolkit) (37) is used to deal with unknown trigrams in nonwords since nonwords are created by training data composed of two RS corpora (32, 33) and a Māori dictionary (35), some nonwords contain trigrams which are not observed in a trigram language model built from a dictionary. Every participant had a different list of 300 stimuli containing 30 pairs (words and nonwords) randomly selected from each of five bins (30*2*5 = 300 in total). The list of stimuli used in Exp1 can be found in “StimuliExp1.txt”.

2.2 Participants

In Exp1, nonfluent Māori speakers based in New Zealand (NMS), 18 years or older, were recruited online by Facebook Ads. Several exclusion criteria were applied to filter out unusable participants in this experiment. All participants speak New Zealand English as their first language. However, nine participants learned English outside New Zealand and among them, we discard one participant who currently lives in New Zealand for less than ten years. Furthermore, based on their answers to the post-questionnaire, two participants whose level of self-reported proficiency of Māori corresponds to at least “fairly well”, two participants with some knowledge of any other Polynesian languages, and five participants who reported a history of any speech or language impairments are removed. Due to a technical error during the experiment, the data of one participant who rated too small number of stimuli is discarded. After removing those outliers, we also look at the pattern of variability in participants’ responses and discard one participant whose standard deviation of ratings falls below two times the standard deviation below the mean among participants.

# Loading the data to filter out participants
dataExp1 <- read.delim("./dataAnonNotFilteredExp1.txt", sep ="\t", header = TRUE)
dataExp1$word = as.character(dataExp1$word)
Encoding(dataExp1$word) = "UTF-8"

# Remove one participant who did 192 items instead of 303
nbResponse <- aggregate(dataExp1$enteredResponse, by=list(dataExp1$workerId),  FUN=length)
rmParticipant1Exp1 <- nbResponse[!nbResponse$x %in% c(303),]$Group.1 # 4c62b407
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant1Exp1,]

# Remove three participants who gave ratings lower than 4 to 3 most common Māori words borrowed into New Zealand English
list_carrot <- c("haka","kai","aotearoa")
carrots <- dataExp1[dataExp1$word %in% list_carrot,]
carrots <- carrots[as.numeric(carrots$enteredResponse) < 4,]
rmParticipant2Exp1 <- unique(carrots$workerId) # 51aee430 7d4efcd7 15950569
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant2Exp1,]

# Remove ratings for the three words (used to detect outliers)
dataExp1 <- dataExp1[!dataExp1$word %in% c("haka","kai","aotearoa"),]

# Remove two participants whose speakMaori or compMaori is at least (equal to or above) 3
rmParticipant3Exp1 <- unique(dataExp1[dataExp1$speakMaori >= 3 | dataExp1$compMaori >= 3,]$workerId) # b59cb76f 56496f48
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant3Exp1,]

# Remove one participant who did not learn their English in NZ and have been living in their current location in NZ for less than ten years (duration == "short")
summaryExp1WorkerId <- unique(dataExp1[,c("workerId","firstLangCountry","place","duration")])
EngNotInNZExp1 <- summaryExp1WorkerId[!summaryExp1WorkerId$firstLangCountry=="NZ",]
rmParticipant4Exp1 <- unique(EngNotInNZExp1[EngNotInNZExp1$duration=="short",]$workerId) # 2fc57ffc
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant4Exp1,]

# Remove two participants who know any other Polynesian languages
rmParticipant5Exp1 <- unique(dataExp1[dataExp1$anyPolynesian=="Yes",]$workerId) # aa4d5676 2b08d4eb
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant5Exp1,]

# Remove five participants with language impairments
rmParticipant6Exp1 <- unique(dataExp1[dataExp1$impairments=="Yes",]$workerId) 
# ae48fa46 b589ecc3 367285b5 6ff6dcb6 bcae3239
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant6Exp1,]

# Remove one participant whose pattern of responses (SD) is below 2SD of the mean of all participants
SD <- aggregate(dataExp1$enteredResponse, by=list(dataExp1$workerId), sd)
cut <- mean(SD$x)-2*sd(SD$x)
rmParticipant7Exp1 <- SD[!SD$x > cut,]$Group.1 # d7a9b857
dataExp1 <- dataExp1[!dataExp1$workerId %in% rmParticipant7Exp1,]

# Check the total number of usable participants for Exp1
# length(unique(dataExp1$workerId)) # 85

2.3 Overview of participants’ sociolinguistic profile

Regarding participants’ age and highest level of education, the distribution is quite balanced in comparison to their self-reported level of Māori proficiency, current place of living and gender. There are substantially more female participants than male and those living in the North Island than in the South Island. Regarding their sociolinguistic profile in relation to Māori, after filtering out participants who considered their level of Māori proficiency at least as fairly well, most participants self-reported that they had some basic knowledge of Māori and only a few participants responded that their level of exposure to Māori was less than once a year. This figure summarizes the distribution of participants on demographic and linguistic axes.

Fig. S1: Overview of participants' sociolinguistic profile in Exp1.

Fig. S1: Overview of participants’ sociolinguistic profile in Exp1.

2.4 Dataset structure

The data is structured as follows:

  • workerId is the unique ID for each participant.
  • enteredResponse is the wellformedness rating for each stimulus.
  • reactionTime is the reaction time for each rating (second).
  • type is the classification of each stimulus: word (‘real’) or nonword (‘pseudo’).
  • length is the phoneme length of each stimulus.
  • word is the stimulus used for the rating.
  • speakMaori is each participant’s answer to the question how well they can speak Māori (with a scale ranging from 0 to 5).
  • compMaori is each participant’s answer to the question how well they can understand/read Māori (with a scale ranging from 0 to 5).
  • maoriProf is the sum of quantified response for speakMaori and compMaori, which refers to the level of each participant’s Māori proficiency.
  • age is the age group where each participant belongs to .
  • gender is the gender of each participant.
  • ethnicity is categorized into binary answers, either Māori (M) or non Māori (non M).
  • education is each participant’s highest level of education.
  • children is each participant’s answer to the question whether they have had any children who have attended preschool or primary school in New Zealand in the past five years.
  • maoriList is each participant’s basic knowledge of Māori (with a scale ranging from 0 to 9).
  • place is each participant’s current place of living (categorized into binary classification, either North or South Island in New Zealand).
  • duration is each participant’s time duration of living in their current place (categorized into binary classification, long: > 10 years and short: =< 10 years).
  • firstLang is each participant’s first language.
  • firstLangCountry is the country where each participant learned their first language.
  • anyOtherLangs is the information regarding any other languages each participant can speak.
  • hawaii is the binary response to the question whether participants have lived in Hawaii.
  • anyPolynesian is the binary response to the question whether participants know any Polynesian such as Hawaiian, Tahitian, Sāmoan, or Tongan.
  • whichPolynesian is the information regarding participants’ knowledge of any Polynesian languages if they knew any.
  • impairments is the answer to the question whether participants have a history of any speech or language impairments.
  • maoriExpo is each participant’s level of exposure to Māori (with a scale ranging from 0 to 10).
  • Freq is the frequency of real word stimulus obtained from Māori running speech corpora (Freq=0 for pseudowords).
  • bin is the classification of each stimulus according to their frequency of occurence in Māori running speech corpora.
  • score is the phonotactic score obtained across word types in the dictionary.

2.5 NMS confidence ratings across word frequency with ordinal mixed-effects modeling

2.5.1 Table S1, model summary

Table S1: Ordinal mixed-effects model of NMS confidence ratings across word frequency. All numeric variables in this model except for the wellformedness rating are centered.
Parameter Estimate Std. Error z.value Pr(>|z|)
Effects c.(length) -0.010 0.019 -0.531 0.595
c.(score) 2.183 0.232 9.404 0.000 ***
typereal 0.659 0.090 7.323 0.000 ***
bin1v2+ 0.000 0.087 0.000 1.000
bin2v3+ 0.098 0.086 1.143 0.253
bin3v4+ 0.057 0.092 0.620 0.535
bin4v5 0.130 0.106 1.229 0.219
c.(length):c.(score) 0.129 0.050 2.611 0.009 **
typereal:bin1v2+ 0.428 0.132 3.251 0.001 **
typereal:bin2v3+ 0.098 0.126 0.776 0.438
typereal:bin3v4+ 0.062 0.129 0.481 0.630
typereal:bin4v5 -0.053 0.151 -0.352 0.725
Thresholds 1|2 -2.626 0.119
2|3 -0.823 0.118
3|4 1.089 0.118
4|5 3.040 0.119

2.5.2 Fig. S2, effect plots

Fig. S2: Interaction effect plots: Fig. S2a (left) shows the interaction between frequency bin and lexicality (real words vs. nonwords); Fig. S2b (right) shows the interaction between phonotactic score and length of stimuli. The bars represent 95% confidence intervals.

Fig. S2: Interaction effect plots: Fig. S2a (left) shows the interaction between frequency bin and lexicality (real words vs. nonwords); Fig. S2b (right) shows the interaction between phonotactic score and length of stimuli. The bars represent 95% confidence intervals.

2.6 Code for plotting Fig. 1

Fig. 1: Mean well-formedness ratings for real words and nonwords by frequency bin. Bin1 contains the most frequent words and Bin5 consists of the least frequent words. Horizontal lines show mean ratings for real words vs. nonwords per bin.

Fig. 1: Mean well-formedness ratings for real words and nonwords by frequency bin. Bin1 contains the most frequent words and Bin5 consists of the least frequent words. Horizontal lines show mean ratings for real words vs. nonwords per bin.

2.7 Mean rating and phonotactic score for each stimulus per frequency bin

Fig. S3: Mean rating vs. phonotactic score for each stimulus for real words and nonwords by frequency bin.

Fig. S3: Mean rating vs. phonotactic score for each stimulus for real words and nonwords by frequency bin.

2.8 Rematching stimuli pairs

During the analysis of Exp1, we found an overall bias towards higher phonotactic scores of Māori words and lower phonotactic score of Māori-like nonwords (see Fig. S3). This pattern is caused by a lack of nonwords which phonotactically matched the randomly selected words from the Māori RS corpora. To solve this problem, nonwords and words are rematched across participants by finding a phonotactically matched word with the lowest absolute difference for each nonword. Among the rematched pairs, only those whose absolute difference is below 0.1 are kept. In the initial dataset, there are 200 pairs in each bin. After rematching stimuli, the number of pairs in each bin is reduced as follows: 60 (Bin1), 38 (Bin2), 39 (Bin3), 46 (Bin4), 43 (Bin5).

2.8.1 Rematched pairs: mean rating and phonotactic score for each stimulus per frequency bin

Fig. S4: Mean rating and phonotactic score for each stimulus per frequency bin (rematched pairs).

Fig. S4: Mean rating and phonotactic score for each stimulus per frequency bin (rematched pairs).

2.8.2 NMS confidence ratings with ordinal mixed-effects model (rematched pairs)

Table S2: Ordinal mixed-effects model of NMS confidence ratings (for rematched pairs).
Parameter Estimate Std. Error z.value Pr(>|z|)
Effects c.(score) 2.626 0.659 3.984 0.000 ***
typereal 0.441 0.105 4.204 0.000 ***
bin1v2+ 0.381 0.168 2.277 0.023 *
bin2v3+ 0.159 0.194 0.820 0.412
bin3v4+ 0.277 0.203 1.362 0.173
bin4v5 -0.010 0.224 -0.045 0.964
typereal:bin1v2+ 0.137 0.232 0.590 0.555
typereal:bin2v3+ -0.057 0.271 -0.209 0.834
typereal:bin3v4+ -0.019 0.284 -0.067 0.947
typereal:bin4v5 0.100 0.322 0.312 0.755
Thresholds 1|2 -2.536 0.128
2|3 -0.800 0.122
3|4 1.032 0.122
4|5 2.806 0.128

Within each frequency bin containing the rematched pairs, the distinction between nonwords and words is statistically significant in separate ordinal mixed-effects models for each bin: NMS give higher ratings to words than nonwords across all the bins, except for bin1. When the ratings of all the bins are modeled in a single ordinal mixed-effects model, phonotactic score and the distinction between words and nonwords are significant predictors. This result demonstrates that NMS’ ability to distinguish between words and nonwords is not an artefact of their phonotactic knowledge.

3 Experiment 2

3.1 Stimuli

The stimulus materials for Exp2 (well-formedness task), comprise of a list of 240 or 320 Māori-like nonwords generated from a trigram model using a pseudoword generator (34). About 6,000 pseudowords (1,000 pseudowords for each phoneme length of pseudowords ranging from length 3 to length 8) were generated by using a Māori dictionary (35) and two running speech (RS) corpora (32,33). Nonwords are chosen by decorrelating their phonotactic scores. The phonotactic scores of nonwords are computed using a trigram language model obtained from a Māori dictionary (35) and token frequency from segmented two RS data (32, 33), and from unsegmented RS data. The phonotactic score of each stimulus is obtained by summing log-transformed transitional probabilities which better reflect frequency distributions than raw probabilities (36) and then is normalized by length. After decorrelating the three types of phonotactic scores, 240 stimuli are chosen for a list of shorter phoneme length (3 and 4) and 320 stimuli for a list of longer phoneme length (5, 6, 7 and 8). In total, there are six lists with either 240 or 320 nonwords of the same phoneme length which are presented in a different random order. The experiment is designed this way in order not to include any additional influence from the length of stimuli. The list of stimuli used in Exp2 can be found in “StimuliExp2.txt”.

3.2 Participants

In Exp2, there are three groups of participants: Fluent Māori speakers (MS), non-Māori-speaking speakers of New Zealand English (NMS), and non-Māori-speaking speakers of American English (US). MS and NMS were recruited online by Facebook Ads and US were recruited online by Amazon Mechanical Turk. Participants are adults (18 years or older) who did not previously study linguistics. Most of NMS speak New Zealand English as their first language but 12 participants who did not learn their English in New Zealand but have been living there for at least 10 years are regarded as NMS. On the other hand, 10 participants who learned English outside New Zealand but currently live in New Zealand for less than ten years are discarded. All US speak American English as their first language and one participant who learned English outside the United States is discarded.

In addition to participants’ first language, multiple criteria are applied to detect unusable participants. For MS, two participants whose self-reported proficiency level of Māori corresponding to either “no more than a few words or phrases” or “not very well” in their post-questionnaire (below the score of 6) are discarded. For NMS, 14 participants who indicated their proficiency level of Māori as “fairly well”, “well”, or “very well” (above the score of 4) are removed. Among them, the data of five participants whose proficiency level of Māori corresponds to at least “fairly well” (equal to or above the score of 6) are imported to MS. For US, 14 participants who reported a certain proficiency level of Māori and six participants with some basic knowledge of Māori are discarded. For NMS and US, we also discard eight participants who have lived in Hawaii and have some knowledge of any other Polynesian languages such as Hawaiian, Tahitian, Tongan, and Samoan, since any knowledge of those languages belonging to the same language family as Māori may influence participants’ well-formedness judgements. However, these criteria are not applied to MS. Among all groups, five participants who reported a history of any speech or language impairments are removed.

The experiment setting for MS and NMS in New Zealand allows us to detect participants using the same browser and within each group, two participants are detected. However, after examining their data, these participants are kept as they do not look suspicious and can be identified as different participants based on their responses to the post-questionnaire. Furthermore, due to a technical error during the experiment, the data of two participants who rated too small or large number of stimuli are discarded.

After removing unusable participants, there are 40 MS, 113 NMS and 94 US. At the initial stage of filtering, 140 participants who did not complete the entire task in Experiment 2 are removed. We also filter out participants who rated items with phoneme length longer than 8 (i.e. 9 and 10 in MS and 9 in NMS) while running the experiment. This decision was made due to a very high rate of uncompleted experiment and the difficulty of recruiting participants for longer lengths such as 9 and 10. We also look at the pattern of participants’ response variability within each group and discard one participant in MS, four participants in NMS, and one participant in US whose standard deviation of responses falls below two times the standard deviation below the mean within each group and discard one participant in NMS whose median of reaction time per stimulus is shorter than two times the standard deviation below the mean within the group.

# Loading the data to filter out participants:
dataNotFiltered <- read.delim("./dataAnonNotFilteredExp2.txt", sep ="\t", header = TRUE)
dataNotFiltered$word = as.character(dataNotFiltered$word)
Encoding(dataNotFiltered$word) = "UTF-8"

# Part 1: Removing unusable MS participants
dataMS <- dataNotFiltered[dataNotFiltered$group=="MS",]

# Remove one participant who did 285 items instead of 320 (length 5)
nbResponse <- aggregate(dataMS$enteredResponse, by=list(dataMS$workerId),  FUN=length)
rmParticipant1 <- nbResponse[!nbResponse$x %in% c(320,240),]$Group.1 # b7619945
dataMS <- dataMS[!dataMS$workerId %in% rmParticipant1,]

# Remove two participants whose speakMaori or compMaori is below 3
rmParticipant2 <- unique(dataMS[dataMS$speakMaori <3 | dataMS$compMaori < 3,]$workerId) 
# 1544dbf9 c4d7bd69
dataMS <- dataMS[!dataMS$workerId %in% rmParticipant2,]

# Remove one participant with language impairments
rmParticipant3 <- unique(dataMS[dataMS$impairments=="Yes",]$workerId) # 7fae7c22
dataMS <- dataMS[!dataMS$workerId %in% rmParticipant3,]

# Detect participants whose median reactionTime is shorter than 2*SD below the mean of all MS
median_RT <- aggregate(dataMS$reactionTime, by=list(dataMS$workerId), median)
names(median_RT) <- c("workerId","median");dataMS <- merge(dataMS,median_RT,by="workerId")
cut <- mean(median_RT$median)-2*sd(median_RT$median)
# median_RT[!median_RT$median > cut,]$workerId # None detected!

# Remove one participant whose pattern of responses (SD) is below 2SD of the mean of all MS participants
SD <- aggregate(dataMS$enteredResponse, by=list(dataMS$workerId), sd)
cut <- mean(SD$x)-2*sd(SD$x)
rmParticipant4 <- SD[!SD$x > cut,]$Group.1 # 379bf450
dataMS <- dataMS[!dataMS$workerId %in% rmParticipant4,]

# Check the total number of usable MS participants
# length(unique(dataMS$workerId)) #40

# Part 2: Removing unusable NMS participants
dataNMS <- dataNotFiltered[dataNotFiltered$group=="NMS",]

# Remove one participant who did 436 items instead of 240 (length 4)
nbResponse <- aggregate(dataNMS$enteredResponse, by=list(dataNMS$workerId),  FUN=length)
rmParticipant5 <- nbResponse[!nbResponse$x %in% c(320,240),]$Group.1 # 10c02fd0
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant5,]

# Remove fourteen participants whose speakMaori or compMaori is at least (equal to or above) 3
rmParticipant6 <- unique(dataNMS[dataNMS$speakMaori >= 3 | dataNMS$compMaori >= 3,]$workerId)  # 13076185 fa784dba 3cde3d04 dab57fd5 a9695b3e 945a57c8 8c51cdc1 930984af 61dd030b 3efd81e0 8f889ddc 379bf450 91950415 72d13b4e
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant6,]

# Remove ten participants who did not learn their English in NZ and have been living in their current location in NZ for less than ten years (duration == "short")
summaryNMSWorkerId <- unique(dataNMS[,c("workerId","firstLangCountry","place","duration")])
EngNotInNZ <- summaryNMSWorkerId[!summaryNMSWorkerId$firstLangCountry=="NZ",]
rmParticipant7 <- unique(EngNotInNZ[EngNotInNZ$duration=="short",]$workerId) # 8332d72f 70dafe1e 4f8b6f05 d3083fc5 e53ac07a 17f67c54 8277a12b 6a1740c7 29d42922 9a7885b3
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant7,]

#Remove one participant who has lived in Hawaii
rmParticipant8 <- unique(dataNMS[dataNMS$hawaii=="Yes",]$workerId) # fc35cab1
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant8,]

# Remove four participants who know any other Polynesian languages
rmParticipant9 <- unique(dataNMS[dataNMS$anyPolynesian=="Yes",]$workerId) # f7466ae0 519ab1a7 a8ef4bcb bdabfcc5
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant9,]

# Remove three participants with language impairments
rmParticipant10 <- unique(dataNMS[dataNMS$impairments=="Yes",]$workerId) # 8e8c19fe 55d7e82a fc8d6ce1
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant10,]

#Remove one participant whose median reactionTime is shorter than 2*SD below the mean of all NMS
median_RT <- aggregate(dataNMS$reactionTime, by=list(dataNMS$workerId), median)
names(median_RT) <- c("workerId","median");dataNMS <- merge(dataNMS,median_RT,by="workerId")
cut <- mean(median_RT$median)-2*sd(median_RT$median)
rmParticipant11 <- median_RT[!median_RT$median > cut,]$workerId # 4b1aac09
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant11,]

# Remove four participants whose pattern of responses (SD) is below 2SD of the mean of all NMS participants
SD <- aggregate(dataNMS$enteredResponse, by=list(dataNMS$workerId), sd)
cut <- mean(SD$x)-2*sd(SD$x)
rmParticipant12 <- SD[!SD$x > cut,]$Group.1 # 3bd22196 4f737f51 6889ff72 98590568
dataNMS <- dataNMS[!dataNMS$workerId %in% rmParticipant12,]

# Check the total number of usable NMS participants
# length(unique(dataNMS$workerId)) #113

# Part3: Removing unusable US participants
dataUS <- dataNotFiltered[dataNotFiltered$group=="US",]

# Remove one participant who did 248 items instead of 240 (length 3)
nbResponse <- aggregate(dataUS$enteredResponse, by=list(dataUS$workerId),  FUN=length)
rmParticipant13 <- nbResponse[!nbResponse$x %in% c(320,240),]$Group.1 # 9198e3ed
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant13,]

# Remove fourteen participants whose speakMaori or compMaori is above 0
rmParticipant14 <- unique(dataUS[dataUS$speakMaori > 0 | dataUS$compMaori > 0,]$workerId)  # 5621e137 da8aa1c2 76e49a1d 20a0b59d cd213e66 a0d0f317 349460f3 78e9bed5 8fcb2f09 84975d69 73c79807 ec169941 7c487108 89bf37e6
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant14,]

# Remove six participants whose maoriList is above 0
rmParticipant15 <- unique(dataUS[dataUS$maoriList > 0,]$workerId) # 5d404970 b40035cc 7233c727 30f0793a 01446e83 c3d73f11
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant15,]

# Remove one participant who did not learn English in the US
rmParticipant16 <- unique(dataUS[!dataUS$firstLangCountry=="US",]$workerId) # c7b992cc
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant16,]

#Remove three participants who have been to Hawaii
rmParticipant17 <- unique(dataUS[dataUS$hawaii=="Yes",]$workerId) # d7464dfa a507847e 6f8cf2ef
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant17,]

#Remove one participant with language impairments
rmParticipant18 <- unique(dataUS[dataUS$impairments=="Yes",]$workerId) # 328bd924
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant18,]

# Detect participants whose median reactionTime is shorter than 2*SD below the mean of all US
median_RT <- aggregate(dataUS$reactionTime, by=list(dataUS$workerId), median)
names(median_RT) <- c("workerId","median");dataUS <- merge(dataUS,median_RT,by="workerId")
cut <- mean(median_RT$median)-2*sd(median_RT$median)
# median_RT[!median_RT$median > cut,]$workerId # None detected!

# Remove one participant whose pattern of responses (SD) is below 2SD of the mean of all US participants
SD <- aggregate(dataUS$enteredResponse, by=list(dataUS$workerId), sd)
cut <- mean(SD$x)-2*sd(SD$x)
rmParticipant19 <- SD[!SD$x > cut,]$Group.1 # 7960174b
dataUS <- dataUS[!dataUS$workerId %in% rmParticipant19,]

# Check the total number of usable US participants
# length(unique(dataUS$workerId)) #94
dataExp2 <- rbind.fill(dataMS, dataNMS, dataUS)

3.3 Overview of participants’ profile

Regarding participants’ gender, there are substantially more female than male participants in MS and NMS while the gender distribution is quite balanced in US. The distribution of participants’ age is also particularly disproportionate in MS and NMS, exhibiting more participants in younger groups. Regarding participants’ highest level of education, there are more high school graduates than undergraduates and graduates in MS whereas there are more undergraduates in NMS and US. We only asked MS and NMS to provide their geographical information in New Zealand and self-report their level of exposure to Māori in daily life. There are more participants from the North Island in both MS and NMS due to a large number of population in the North Island. The level of exposure to Māori ranges from 2 (“less than once a year”) to 10 (“multiple times a day”) and the distribution of participants’ level of exposure reveals that most of MS are very frequently exposed to Māori in daily life and only a very small number of both MS and NMS are exposed to Māori less than once a year.

Fig. S5: Participants' basic Māori knowledge and proficiency in Exp2.

Fig. S5: Participants’ basic Māori knowledge and proficiency in Exp2.

Fig. S6: Overview of participants' profile in Exp2.

Fig. S6: Overview of participants’ profile in Exp2.

Fig. S7: Overview of participants' profile in Exp2.

Fig. S7: Overview of participants’ profile in Exp2.

3.4 Dataset structure

The data is structured as follows:

  • workerId is the unique ID for each participant.
  • enteredResponse is the wellformedness rating for each stimulus.
  • reactionTime is the reaction time for each rating (second).
  • word is the stimulus used for the rating.
  • speakMaori is each participant’s answer to the question how well they can speak Māori (with a scale ranging from 0 to 5).
  • compMaori is each participant’s answer to the question how well they can understand/read Māori (with a scale ranging from 0 to 5).
  • maoriProf is the sum of quantified response for speakMaori and compMaori, which refers to the level of each participant’s Māori proficiency.
  • age is the age group where each participant belongs to .
  • gender is the gender of each participant.
  • ethnicity is categorized into binary answers, either Māori (M) or non Māori (non M).
  • education is each participant’s highest level of education.
  • children is each participant’s answer to the question whether they have had any children who have attended preschool or primary school in New Zealand in the past five years.
  • maoriList is each participant’s basic knowledge of Māori (with a scale ranging from 0 to 9).
  • place is each participant’s current place of living (categorized into binary classification, either North or South Island in New Zealand).
  • duration is each participant’s time duration of living in their current place (categorized into binary classification, long: > 10 years and short: =< 10 years).
  • firstLang is each participant’s first language.
  • firstLangCountry is the country where each participant learned their first language.
  • anyOtherLangs is the information regarding any other languages each participant can speak.
  • hawaii is the binary response to the question whether participants have lived in Hawaii.
  • anyPolynesian is the binary response to the question whether participants know any Polynesian such as Hawaiian, Tahitian, Sāmoan, or Tongan.
  • whichPolynesian is the information regarding participants’ knowledge of any Polynesian languages if they knew any.
  • impairments is the answer to the question whether participants have a history of any speech or language impairments.
  • maoriExpo is each participant’s level of exposure to Māori (with a scale ranging from 0 to 10).
  • group is the classification of participants according to their fluency of Māori and/or exposure to Māori.
  • nz is the binary response to the question whether participants have ever lived in New Zealand.
  • scoreDictType is the phonotactic score obtained across word types in the dictionary.
  • scoreDictToken is the phonotactic score obtained across tokens of words from the dictionary in running speech corpora.
  • scoreRsSegmented is the phonotactic score obtained across tokens of all words in running speech corpora.
  • scoreRsUnsegmented is the phonotactic score obtained across lexically unsegmented stretches of speech in running speech corpora.
  • scoreDictLongAType is the phonotactic score obtained across word types in the dictionary ignoring vowel length distinctions except /a~a:/.
  • scoreDictShortVType is the phonotactic score obtained across word types in the dictionary ignoring all vowel length distinctions.
  • scoreShortVDictNze is the phonotactic score obtained across words derived from the dictionary of Māori words in New Zealand English.
  • scoreNBestFixed is the phonotactic score from 3470 most frequent word types with the lowest AIC score obtained using cutoffs based on raw frequency.
  • scoreMorphShortVType is the phonotactic score obtained across morph types derived from words in the dictionary ignoring vowel length distinctions, unweighted.
  • scoreMorphShortVToken is the phonotactic score obtained across tokens of morphs derived from words in the dictionary ignoring vowel length distinctions, based on running speech corpora, weighted by the number of times they occur in corpora.
  • median is the median of each participant’s reaction time.

3.5 Number of participants per length within each group

Table S3: Number of participants per length within each group
length MS NMS US
3 8 18 16
4 5 17 13
5 6 20 16
6 7 21 16
7 9 18 14
8 5 19 19

3.6 Comparing AIC scores according to resources

Table S4: Comparing AIC scores according to resources
Score AIC
scoreDictType 208613.5
scoreDictToken 208806.4
scoreRsSegmented 209007.4
scoreRsUnsegmented 210203.5

3.7 Comparing AIC scores according to vowel length distinction

Table S5: Comparing AIC scores according to vowel length distinction
Score AIC
scoreDictType 208613.5
scoreDictLongAType 208458.3
scoreDictShortVType 208110.9

3.8 Well-formedness ratings by participant group (Summary of the best model with the lowest AIC)

Table S6: Ordinal mixed-effects model summary for well-formedness ratings by participant group. All numeric variables in this model are centered.
Parameter Estimate Std. Error z.value Pr(>|z|)
Effects c.(scoreDictShortVType) 3.118 0.177 17.610 0.000 ***
macronTRUE 0.104 0.056 1.858 0.063 .
groupMS -0.098 0.185 -0.526 0.599
groupUS 0.611 0.142 4.300 0.000 ***
c.(scoreDictShortVType):groupMS -0.089 0.302 -0.297 0.767
c.(scoreDictShortVType):groupUS -2.288 0.241 -9.509 0.000 ***
macronTRUE:groupMS 0.060 0.073 0.818 0.414
macronTRUE:groupUS -0.147 0.067 -2.206 0.027 *
Thresholds 1|2 -1.730 0.098
2|3 -0.176 0.098
3|4 0.960 0.098
4|5 2.506 0.098

3.9 Code for plotting Fig. 2

Fig. 2: Interaction between phonotactic scores and participant groups. The range of phonotactic score is represented on the x-axis and the range of predicted rating is represented on the y-axis.

Fig. 2: Interaction between phonotactic scores and participant groups. The range of phonotactic score is represented on the x-axis and the range of predicted rating is represented on the y-axis.

3.9.1 Phonotactics derived from the dictionary of Māori words in New Zealand English

Phonotactics from Māori borrowings in New Zealand English is added as an additional predictor to the best ordinal mixed-effects model presented in Table S6.

Table S7: Ordinal mixed-effects model of well-formedness ratings including the phonotactics derived from the dictionary of Māori words in New Zealand English
Parameter Estimate Std. Error z.value Pr(>|z|)
Effects c.(scoreDictShortVType) 1.716 0.192 8.945 0.000 ***
c.(scoreDictNzeShortV) 2.082 0.213 9.778 0.000 ***
macronTRUE 0.120 0.055 2.198 0.028 *
groupMS -0.093 0.186 -0.498 0.618
groupUS 0.623 0.143 4.371 0.000 ***
c.(scoreDictNzeShortV):macronTRUE -0.659 0.224 -2.947 0.003 **
c.(scoreDictShortVType):groupMS 0.281 0.270 1.042 0.297
c.(scoreDictShortVType):groupUS -1.267 0.237 -5.356 0.000 ***
c.(scoreDictNzeShortV):groupMS -0.539 0.289 -1.864 0.062 .
c.(scoreDictNzeShortV):groupUS -1.619 0.259 -6.257 0.000 ***
macronTRUE:groupMS 0.054 0.073 0.737 0.461
macronTRUE:groupUS -0.153 0.066 -2.313 0.021 *
c.(scoreDictNzeShortV):macronTRUE:groupMS 0.102 0.246 0.414 0.679
c.(scoreDictNzeShortV):macronTRUE:groupUS 0.800 0.250 3.204 0.001 **
Thresholds 1|2 -1.722 0.098
2|3 -0.164 0.098
3|4 0.976 0.098
4|5 2.528 0.098

3.10 Post hoc exploration

3.10.1 Monte Carlo analyses

# (1) Monte Carlo analysis with NMS participants for lexicon sizes N ranging from 1k to 18k, drawing 1000 random samples of N types from the lexicon
dir.create("./MC");dir.create("./MC/dictTypeSample");dir.create("./MC/dictTypePM");dir.create("./MC/dictTypePM/aux");dir.create("./MC/dictTypeScore");dir.create("./MC/dictTypeResult")
for(i in 1:18){
  directoryName <- paste0("./MC/dictTypeSample/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/dictTypePM/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/dictTypePM/aux/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/dictTypeScore/size",i,"k");dir.create(directoryName)
}
dictShortVType <- read.delim("./dict-shortvowels_types.txt", sep="\t", header= FALSE)
for(i in 1:18){
wdSample <- paste0("./MC/dictTypeSample/size",i,"k");m <- i*1000;set.seed(1234);setwd(wdSample)
for(j in 1:1000){
randomWords <- sample(dictShortVType$V1, m, replace = FALSE);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomWords, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
# Generate phonotactic scores for each sample using "score-nofreq.sh"
shellCommand <- paste0("./score-nofreq.sh ./MC/dictTypeSample/size",i,"k/randomization_",j,".txt ./stimuli-shortvowels.txt ./MC/dictTypePM/size",i,"k ./MC/dictTypePM/aux/size",i,"k ./MC/dictTypeScore/size",i,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/dictTypeScore/size",i,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
# Run ordinal regression models with 1000 samples
for(k in 1:1000){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../../..");wdResult <- paste0("./MC/dictTypeResult");setwd(wdResult);outputAIC <- paste0("dictTypeSize",i,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (2) Repeating the same Monte Carlo analysis with MS participants for lexicon sizes N ranging from 1k to 18k, drawing 1000 random samples of N types from the lexicon
dir.create("./MC/msDictTypeResult")
for(i in 1:18){
wdScore <- paste0("./MC/dictTypeScore/size",i,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../../..");wdResult <- paste0("./MC/msDictTypeResult");setwd(wdResult);outputAIC <- paste0("msDictTypeSize",i,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC, row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (3) Monte Carlo analysis with NMS participants for lexicon sizes N ranging from 1k to 18k, drawing 1000 random samples of N types from the lexicon, weighted by token frequency
dir.create("./MC/dictTokenSample");dir.create("./MC/dictTokenPM");dir.create("./MC/dictTokenPM/aux");dir.create("./MC/dictTokenScore");dir.create("./MC/dictTokenResult")
for(i in 1:18){
  directoryName <- paste0("./MC/dictTokenSample/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/dictTokenPM/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/dictTokenPM/aux/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/dictTokenScore/size",i,"k");dir.create(directoryName)
}
dictShortVFreq <- read.delim("./dict-shortvowels_freq.txt", sep="\t", header= TRUE)
for(i in 1:18){
wdSample <- paste0("./MC/dictTokenSample/size",i,"k");m <- i*1000;set.seed(1234);setwd(wdSample)
for(j in 1:1000){randomWords <- sample(dictShortVFreq$word, m, replace = FALSE, prob = dictShortVFreq$tokens);randomWords <- gsub(""," ",randomWords);randomWords <- gsub("^ ","",randomWords);randomWords <- gsub(" $","",randomWords);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomWords, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq.sh ./MC/dictTokenSample/size",i,"k/randomization_",j,".txt ./stimuli-shortvowels.txt ./MC/dictTokenPM/size",i,"k ./MC/dictTokenPM/aux/size",i,"k ./MC/dictTokenScore/size",i,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/dictTokenScore/size",i,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../../..");wdResult <- paste0("./MC/dictTokenResult");setwd(wdResult);outputAIC <- paste0("dictTokenSize",i,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (4) Repeating the same Monte Carlo analysis with MS participants for lexicon sizes N ranging from 1k to 18k, drawing 1000 random samples of N types from the lexicon, weighted by token frequency
dir.create("./MC/msDictTokenResult")
for(i in 1:18){
wdScore <- paste0("./MC/dictTokenScore/size",i,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../../..");wdResult <- paste0("./MC/msDictTokenResult");setwd(wdResult);outputAIC <- paste0("msDictTokenSize",i,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC, row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (5) Monte Carlo analysis with NMS participants for lexicon sizes N ranging from 1k to 18k, consisting of the randomly sampled N highest-frequency types in the lexicon 
dir.create("./MC/NBestSample");dir.create("./MC/NBestPM");dir.create("./MC/NBestPM/aux");dir.create("./MC/NBestScore");dir.create("./MC/NBestResult")
for(i in 1:18){
  directoryName <- paste0("./MC/NBestSample/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/NBestPM/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/NBestPM/aux/size",i,"k");dir.create(directoryName)
}
for(i in 1:18){
  directoryName <- paste0("./MC/NBestScore/size",i,"k");dir.create(directoryName)
}
dictShortVFreq <- read.delim("./dict-shortvowels_freq.txt", sep="\t", header= TRUE)
for(i in 1:18){
wdSample <- paste0("./MC/NBestSample/size",i,"k");m <- i*1000;setwd(wdSample)
for(j in 1:1000){
dictShortVFreq <- dictShortVFreq[order(runif(nrow(dictShortVFreq))),]
dictShortVFreq <- dictShortVFreq[order(dictShortVFreq$tokens, decreasing=TRUE),]  
randomWords <- dictShortVFreq$word[1:m]
randomWords <- gsub(""," ",randomWords);randomWords <- gsub("^ ","",randomWords);randomWords <- gsub(" $","",randomWords);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomWords, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq.sh ./MC/NBestSample/size",i,"k/randomization_",j,".txt ./stimuli-shortvowels.txt ./MC/NBestPM/size",i,"k ./MC/NBestPM/aux/size",i,"k /MC/NBestScore/size",i,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/NBestScore/size",i,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../../..");wdResult <- paste0("./MC/NBestResult");setwd(wdResult);outputAIC <- paste0("NBestSize",i,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (6) Repeating the same Monte Carlo analysis with MS participants for lexicon sizes N ranging from 1k to 18k, consisting of the randomly sampled N highest-frequency types in the lexicon
dir.create("./MC/msNBestResult")
for(i in 1:18){
wdScore <- paste0("./MC/NBestScore/size",i,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../../..");wdResult <- paste0("./MC/msNBestResult");setwd(wdResult);outputAIC <- paste0("msNBestSize",i,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC, row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (7) Monte Carlo analysis with NMS participants for lexicon sizes N ranging from 3k to 18k (using cutoffs based on raw frequency), consisting of the fixed N-most-frequent types appearing at least five times in running speech
dir.create("./MC/NBestFixedSample");dir.create("./MC/NBestFixedPM");dir.create("./MC/NBestFixedPM/aux");dir.create("./MC/NBestFixedScore");dir.create("./MC/NBestFixedResult")
dictShortVFreqCutOffs <- read.delim("./dict-shortvowels_freq_cutoffs.txt", sep="\t", header= TRUE)
listCutOffs <- dictShortVFreqCutOffs[dictShortVFreqCutOffs$N > 3000,]$N
for(i in listCutOffs){
wdSample <- paste0("./MC/NBestFixedSample");setwd(wdSample)
dictShortVFreq <- dictShortVFreq[order(dictShortVFreq$tokens.raw, decreasing=TRUE),]
randomWords <- dictShortVFreq$word[1:i];randomWords <- gsub(""," ",randomWords);randomWords <- gsub("^ ","",randomWords);randomWords <- gsub(" $","",randomWords)
outputFile <- paste0("NBestFixedSample",i,".txt")
write.table(randomWords, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq.sh ./MC/NBestFixedSample/NBestFixedSample",i,".txt ./stimuli-shortvowels.txt ./MC/NBestFixedPM ./MC/NBestFixedPM/aux ./MC/NBestFixedScore")
system(shellCommand, intern = TRUE)
setwd("./../..")
}
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
wdScore <- paste0("./MC/NBestFixedScore");setwd(wdScore)
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:6){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../..");wdResult <- paste0("./MC/NBestFixedResult");setwd(wdResult);outputAIC <- paste0("NBestFixedListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")

listAIC <- as.data.frame(unlist(listAIC));list <- data.frame(cbind(unlist(listCutOffs)))
listAIC <- cbind(listAIC, list);colnames(listAIC) <- c("AICscore","size")
listAIC$size[(which(listAIC==min(listAIC$AICscore)))] # 3470
min(listAIC$AICscore) # 103767.4
listAIC <- listAIC[,c("size", "AICscore")];colnames(listAIC) <- c("Cutoff","AIC")
# write.table(listAIC,"dataTableS7.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)

# (8) Repeating the same Monte Carlo analysis with MS participants for lexicon sizes N ranging from 3k to 18k (using cutoffs based on raw frequency), consisting of the fixed N-most-frequent types appearing at least five times in running speech
dir.create("./MC/msNBestFixedResult")
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
wdScore <- paste0("./MC/NBestFixedScore");setwd(wdScore)
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:6){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../..");wdResult <- paste0("./MC/msNBestFixedResult");setwd(wdResult);outputAIC <- paste0("msNBestFixedListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")

listAIC <- as.data.frame(unlist(listAIC));list <- data.frame(cbind(unlist(listCutOffs)))
listAIC <- cbind(listAIC, list);colnames(listAIC) <- c("AICscore","size")
listAIC$size[(which(listAIC==min(listAIC$AICscore)))] # 3879
min(listAIC$AICscore) # 36657.44
listAIC <- listAIC[,c("size", "AICscore")];colnames(listAIC) <- c("Cutoff","AIC")
# write.table(listAIC,"dataTableS8.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)

for(i in 1:18){
  assign(paste0("size",i,"k"),read.delim(paste0("./MC/dictTypeResult/dictTypeSize",i,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size1k$size <- "1k";size2k$size <- "2k";size3k$size <- "3k";size4k$size <- "4k";size5k$size <- "5k";size6k$size <- "6k";size7k$size <- "7k";size8k$size <- "8k";size9k$size <- "9k";size10k$size <- "10k";size11k$size <- "11k";size12k$size <- "12k";size13k$size <- "13k";size14k$size <- "14k";size15k$size <- "15k";size16k$size <- "16k";size17k$size <- "17k";size18k$size <- "18k"
dictTypeAIC <- rbind(size1k, size2k, size3k, size4k, size5k, size6k, size7k, size8k, size9k, size10k, size11k, size12k, size13k, size14k, size15k, size16k, size17k, size18k)
dictTypeAIC$type <- "unweighted"

for(i in 1:18){
  assign(paste0("size",i,"k"),read.delim(paste0("./MC/dictTokenResult/dictTokenSize",i,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size1k$size <- "1k";size2k$size <- "2k";size3k$size <- "3k";size4k$size <- "4k";size5k$size <- "5k";size6k$size <- "6k";size7k$size <- "7k";size8k$size <- "8k";size9k$size <- "9k";size10k$size <- "10k";size11k$size <- "11k";size12k$size <- "12k";size13k$size <- "13k";size14k$size <- "14k";size15k$size <- "15k";size16k$size <- "16k";size17k$size <- "17k";size18k$size <- "18k"
dictTokenAIC <- rbind(size1k, size2k, size3k, size4k, size5k, size6k, size7k, size8k, size9k, size10k, size11k, size12k, size13k, size14k, size15k, size16k, size17k, size18k)
dictTokenAIC$type <- "frequency-weighted"

for(i in 1:18){
  assign(paste0("size",i,"k"),read.delim(paste0("./MC/NBestResult/NBestSize",i,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size1k$size <- "1k";size2k$size <- "2k";size3k$size <- "3k";size4k$size <- "4k";size5k$size <- "5k";size6k$size <- "6k";size7k$size <- "7k";size8k$size <- "8k";size9k$size <- "9k";size10k$size <- "10k";size11k$size <- "11k";size12k$size <- "12k";size13k$size <- "13k";size14k$size <- "14k";size15k$size <- "15k";size16k$size <- "16k";size17k$size <- "17k";size18k$size <- "18k"
NBestAIC <- rbind(size1k, size2k, size3k, size4k, size5k, size6k, size7k, size8k, size9k, size10k, size11k, size12k, size13k, size14k, size15k, size16k, size17k, size18k)
NBestAIC$type <- "N highest-frequency" 

nmsAIC <- rbind(dictTypeAIC, dictTokenAIC, NBestAIC)

for(i in 1:18){
  assign(paste0("size",i,"k"),read.delim(paste0("./MC/msDictTypeResult/msDictTypeSize",i,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size1k$size <- "1k";size2k$size <- "2k";size3k$size <- "3k";size4k$size <- "4k";size5k$size <- "5k";size6k$size <- "6k";size7k$size <- "7k";size8k$size <- "8k";size9k$size <- "9k";size10k$size <- "10k";size11k$size <- "11k";size12k$size <- "12k";size13k$size <- "13k";size14k$size <- "14k";size15k$size <- "15k";size16k$size <- "16k";size17k$size <- "17k";size18k$size <- "18k"
msDictTypeAIC <- rbind(size1k, size2k, size3k, size4k, size5k, size6k, size7k, size8k, size9k, size10k, size11k, size12k, size13k, size14k, size15k, size16k, size17k, size18k)
msDictTypeAIC$type <- "unweighted"

for(i in 1:18){
  assign(paste0("size",i,"k"),read.delim(paste0("./MC/msDictTokenResult/msDictTokenSize",i,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size1k$size <- "1k";size2k$size <- "2k";size3k$size <- "3k";size4k$size <- "4k";size5k$size <- "5k";size6k$size <- "6k";size7k$size <- "7k";size8k$size <- "8k";size9k$size <- "9k";size10k$size <- "10k";size11k$size <- "11k";size12k$size <- "12k";size13k$size <- "13k";size14k$size <- "14k";size15k$size <- "15k";size16k$size <- "16k";size17k$size <- "17k";size18k$size <- "18k"
msDictTokenAIC <- rbind(size1k, size2k, size3k, size4k, size5k, size6k, size7k, size8k, size9k, size10k, size11k, size12k, size13k, size14k, size15k, size16k, size17k, size18k)
msDictTokenAIC$type <- "frequency-weighted"

for(i in 1:18){
  assign(paste0("size",i,"k"),read.delim(paste0("./MC/msNBestResult/msNBestSize",i,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size1k$size <- "1k";size2k$size <- "2k";size3k$size <- "3k";size4k$size <- "4k";size5k$size <- "5k";size6k$size <- "6k";size7k$size <- "7k";size8k$size <- "8k";size9k$size <- "9k";size10k$size <- "10k";size11k$size <- "11k";size12k$size <- "12k";size13k$size <- "13k";size14k$size <- "14k";size15k$size <- "15k";size16k$size <- "16k";size17k$size <- "17k";size18k$size <- "18k"
msNBestAIC <- rbind(size1k, size2k, size3k, size4k, size5k, size6k, size7k, size8k, size9k, size10k, size11k, size12k, size13k, size14k, size15k, size16k, size17k, size18k)
msNBestAIC$type <- "N highest-frequency"

msAIC <- rbind(msDictTypeAIC, msDictTokenAIC, msNBestAIC)
# write.table(nmsAIC,"dataFig3NMS.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
# write.table(msAIC,"dataFig3MS.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)

# Run ordinal regression models with full lexicon
clmfitFullDictNMS <- clm(enteredResponse ~ scoreDictShortVType + macron, data=dataNMS)
clmfitFullDictMS <- clm(enteredResponse ~ scoreDictShortVType + macron, data=dataMS)
# saveRDS(clmfitFullDictNMS, file = "fullDictNMS.rds")
# saveRDS(clmfitFullDictMS, file = "fullDictMS.rds")

3.10.2 Comparing AIC scores using cutoffs based on raw frequency

Table S8: Comparing AIC scores with NMS participants, using cutoffs based on raw frequency
Cutoff AIC
3164 103818.7
3470 103767.4
3879 103849.8
4526 103832.8
5821 103847.2
18703 103924.7
Table S9: Comparing AIC scores with MS participants, using cutoffs based on raw frequency
Cutoff AIC
3164 36682.47
3470 36664.50
3879 36657.44
4526 36662.62
5821 36678.26
18703 36675.80

3.10.3 Code for plotting Fig. 3

Fig. 3: Monte Carlo simulations with 1,000 random samples over 18 dictionary sizes. The bars represent bootstrap 95% confidence intervals.

Fig. 3: Monte Carlo simulations with 1,000 random samples over 18 dictionary sizes. The bars represent bootstrap 95% confidence intervals.

3.11 Comparing AIC scores for model selection for sub-word units (morphs)

Table S10: Comparing AIC score for model selection for sub-word units (morphs)
Score AIC
scoreDictShortVType 208110.9
scoreMorphShortVType 207787.4
scoreMorphShortVToken 209698.5

3.12 Phonotactic knowledge from morphs (Summary of the best model with the lowest AIC)

Table S11: Ordinal mixed-effects model summary for well-formedness ratings by participant group for sub-word units (morphs). All numeric variables in this model are centered.
Parameter Estimate Std. Error z.value Pr(>|z|)
Effects c.(scoreMorphShortVType) 4.271 0.244 17.518 0.000 ***
macronTRUE 0.183 0.055 3.332 0.001 ***
groupMS -0.046 0.199 -0.234 0.815
groupUS 0.662 0.152 4.345 0.000 ***
c.(scoreMorphShortVType):groupMS -0.480 0.427 -1.124 0.261
c.(scoreMorphShortVType):groupUS -3.315 0.337 -9.837 0.000 ***
macronTRUE:groupMS 0.049 0.073 0.676 0.499
macronTRUE:groupUS -0.219 0.066 -3.334 0.001 ***
Thresholds 1|2 -1.680 0.105
2|3 -0.121 0.104
3|4 1.019 0.104
4|5 2.570 0.105

3.12.1 Monte Carlo analyses with the phonotactics derived from morphs

# (1) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k, drawing 1000 random samples of N types from the morph set
dir.create("./MC/morphTypeSample");dir.create("./MC/morphTypePM");dir.create("./MC/morphTypePM/aux");dir.create("./MC/morphTypeScore");dir.create("./MC/morphTypeResult")
for(i in 1:7){
  directoryName <- paste0("./MC/morphTypeSample/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTypePM/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTypePM/aux/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTypeScore/size",i*0.5,"k");dir.create(directoryName)
}
morphShortVFreq <- read.delim("./morphs-shortvowels_freq.txt", sep="\t", header= TRUE)
for(i in 1:7){
wdSample <- paste0("./MC/morphTypeSample/size",i*0.5,"k");m <- i*500;set.seed(1234);setwd(wdSample)
for(j in 1:1000){
randomMorphs <- sample(morphShortVFreq$morph, m, replace = FALSE)
randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq_morphs.sh ./MC/morphTypeSample/size",i*0.5,"k/randomization_",j,".txt ./stimuli-shortvowels_morphs.txt ./MC/morphTypePM/size",i*0.5,"k ./MC/morphTypePM/aux/size",i*0.5,"k ./MC/morphTypeScore/size",i*0.5,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/morphTypeScore/size",i*0.5,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/morphTypeResult");setwd(wdResult)
outputAIC <- paste0("morphTypeSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (2) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k, drawing 1000 random samples of N types from the morph set
dir.create("./MC/msMorphTypeResult")
for(i in 1:7){
wdScore <- paste0("./MC/morphTypeScore/size",i*0.5,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/msMorphTypeResult");setwd(wdResult)
outputAIC <- paste0("msMorphTypeSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (3) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k, drawing 1000 random samples of N types from the morph set, weighted by their token frequency in running speech 
dir.create("./MC/morphTokenSample");dir.create("./MC/morphTokenPM");dir.create("./MC/morphTokenPM/aux");dir.create("./MC/morphTokenScore");dir.create("./MC/morphTokenResult")
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenSample/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenPM/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenPM/aux/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenScore/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
wdSample <- paste0("./MC/morphTokenSample/size",i*0.5,"k");m <- i*500;set.seed(1234);setwd(wdSample)
for(j in 1:1000){randomMorphs <- sample(morphShortVFreq$morph, m, replace = FALSE, prob=morphShortVFreq$tokens)
randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs);outputFile <- paste0("randomization_", j, ".txt")  
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq_morphs.sh ./MC/morphTokenSample/size",i*0.5,"k/randomization_",j,".txt ./stimuli-shortvowels_morphs.txt ./MC/morphTokenPM/size",i*0.5,"k ./MC/morphTokenPM/aux/size",i*0.5,"k ./MC/morphTokenScore/size",i*0.5,"k")
system(shellCommand, intern = TRUE)
} 
setwd("./../../..");wdScore <- paste0("./MC/morphTokenScore/size",i*0.5,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list); list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
  }  
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/morphTokenResult");setwd(wdResult)
outputAIC <- paste0("morphTokenSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (4) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k, drawing 1000 random samples of N types from the morph set, weighted by their token frequency in running speech 
dir.create("./MC/msMorphTokenResult")
for(i in 1:7){
wdScore <- paste0("./MC/morphTokenScore/size",i*0.5,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/msMorphTokenResult");setwd(wdResult)
outputAIC <- paste0("msMorphTokenSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (5) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k, drawing 1000 random samples of N-best morph types from the morph set, weighted by their token frequency in running speech
dir.create("./MC/morphTokenNBestSample");dir.create("./MC/morphTokenNBestPM");dir.create("./MC/morphTokenNBestPM/aux");dir.create("./MC/morphTokenNBestScore");dir.create("./MC/morphTokenNBestResult")
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenNBestSample/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenNBestPM/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenNBestPM/aux/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphTokenNBestScore/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
wdSample <- paste0("./MC/morphTokenNBestSample/size",i*0.5,"k");m <- i*500;setwd(wdSample)
for(j in 1:1000){
morphShortVFreq <- morphShortVFreq[order(runif(nrow(morphShortVFreq))),]
morphShortVFreq <- morphShortVFreq[order(morphShortVFreq$tokens.raw, decreasing=TRUE),]
randomMorphs <- morphShortVFreq$morph[1:m]  
randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq_morphs.sh ./MC/morphTokenNBestSample/size",i*0.5,"k/randomization_",j,".txt ./stimuli-shortvowels_morphs.txt ./MC/morphTokenNBestPM/size",i*0.5,"k ./MC/morphTokenNBestPM/aux/size",i*0.5,"k ./MC/morphTokenNBestScore/size",i*0.5,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/morphTokenNBestScore/size",i*0.5,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
  }
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/morphTokenNBestResult");setwd(wdResult)
outputAIC <- paste0("morphTokenNBestSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (6) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k, drawing 1000 random samples of N-best morph types from the morph set, weighted by their token frequency in running speech
dir.create("./MC/msMorphTokenNBestResult")
for(i in 1:7){
wdScore <- paste0("./MC/morphTokenNBestScore/size",i*0.5,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/msMorphTokenNBestResult");setwd(wdResult)
outputAIC <- paste0("msMorphTokenNBestSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (7) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k (using cutoffs based on raw frequency of the words they appear in), consisting of the fixed N-best morph types appearing at least 15 times in running speech
dir.create("./MC/morphNBestFixedSample");dir.create("./MC/morphNBestFixedPM");dir.create("./MC/morphNBestFixedPM/aux");dir.create("./MC/morphNBestFixedScore");dir.create("./MC/morphNBestFixedResult")
morphShortVFreqCutOffs <- read.delim("./morphs-shortvowels_freq_cutoffs.txt", sep="\t", header= TRUE)
listCutOffs <- morphShortVFreqCutOffs[morphShortVFreqCutOffs$N > 1000,]$N
for(i in listCutOffs){
wdSample <- paste0("./MC/morphNBestFixedSample");setwd(wdSample)
morphShortVFreq <- morphShortVFreq[order(morphShortVFreq$tokens.raw, decreasing=TRUE),]
randomMorphs <- morphShortVFreq$morph[1:i];randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs)
outputFile <- paste0("morphNBestFixedSample",i,".txt")
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq_morphs.sh ./MC/morphNBestFixedSample/morphNBestFixedSample",i,".txt ./stimuli-shortvowels_morphs.txt ./MC/morphNBestFixedPM ./MC/morphNBestFixedPM/aux ./MC/morphNBestFixedScore")
system(shellCommand, intern = TRUE)
setwd("./../..")
}
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
list <- read.csv(list);list <- list[,c("item","logprob")]
dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
return(dataNew)
}
wdScore <- paste0("./MC/morphNBestFixedScore");setwd(wdScore)
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:16){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../..");wdResult <- paste0("./MC/morphNBestFixedResult");setwd(wdResult);outputAIC <- paste0("NBestFixedListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")

listAIC <- as.data.frame(unlist(listAIC));list <- data.frame(cbind(unlist(listCutOffs)))
listAIC <- cbind(listAIC, list);colnames(listAIC) <- c("AICscore","size")
listAIC$size[(which(listAIC==min(listAIC$AICscore)))] # 1629
min(listAIC$AICscore) # 103363.2
listAIC <- listAIC[,c("size", "AICscore")];colnames(listAIC) <- c("Cutoff","AIC")
# write.table(listAIC,"dataTableS11.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)

# (8) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k (using cutoffs based on raw frequency of the words they appear in), consisting of the fixed N-best morph types appearing at least 15 times in running speech
dir.create("./MC/msMorphNBestFixedResult")
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreDictShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
wdScore <- paste0("./MC/morphNBestFixedScore");setwd(wdScore)
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:16){
  files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
  clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
  listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
}
setwd("./../..");wdResult <- paste0("./MC/msMorphNBestFixedResult");setwd(wdResult);outputAIC <- paste0("msMorphNBestFixedListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")

listAIC <- as.data.frame(unlist(listAIC));list <- data.frame(cbind(unlist(listCutOffs)))
listAIC <- cbind(listAIC, list);colnames(listAIC) <- c("AICscore","size")
listAIC$size[(which(listAIC==min(listAIC$AICscore)))] # 3636
min(listAIC$AICscore) # 36718.47
listAIC <- listAIC[,c("size", "AICscore")];colnames(listAIC) <- c("Cutoff","AIC")
# write.table(listAIC,"dataTableS12.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)

# (9) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k, considering stimuli as unparsed items, unweighted
dir.create("./MC/morphShortVUnsegSample");dir.create("./MC/morphShortVUnsegPM");dir.create("./MC/morphShortVUnsegPM/aux");dir.create("./MC/morphShortVUnsegScore");dir.create("./MC/morphShortVUnsegResult")
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegSample/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegPM/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegPM/aux/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegScore/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
wdSample <- paste0("./MC/morphShortVUnsegSample/size",i*0.5,"k");m <- i*500;setwd(wdSample)
for(j in 1:1000){
randomMorphs <- sample(morphShortVFreq$morph, m, replace = FALSE)
randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq.sh ./MC/morphShortVUnsegSample/size",i*0.5,"k/randomization_",j,".txt ./stimuli-shortvowels.txt ./MC/morphShortVUnsegPM/size",i*0.5,"k  ./MC/morphShortVUnsegPM/aux/size",i*0.5,"k ./MC/morphShortVUnsegScore/size",i*0.5,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/morphShortVUnsegScore/size",i*0.5,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/morphShortVUnsegResult");setwd(wdResult)
outputAIC <- paste0("morphShortVUnsegSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (10) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k, considering stimuli as unparsed items, unweighted
dir.create("./MC/msMorphShortVUnsegResult")
for(i in 1:7){
wdScore <- paste0("./MC/morphShortVUnsegScore/size",i*0.5,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/msMorphShortVUnsegResult");setwd(wdResult)
outputAIC <- paste0("msMorphShortVUnsegSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (11) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k, considering stimuli as unparsed items, weighted by their token frequency in running speech
dir.create("./MC/morphShortVUnsegTokenSample");dir.create("./MC/morphShortVUnsegTokenPM");dir.create("./MC/morphShortVUnsegTokenPM/aux");dir.create("./MC/morphShortVUnsegTokenScore");dir.create("./MC/morphShortVUnsegTokenResult")
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenSample/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenPM/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenPM/aux/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenScore/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
wdSample <- paste0("./MC/morphShortVUnsegTokenSample/size",i*0.5,"k");m <- i*500;setwd(wdSample)
for(j in 1:1000){
randomMorphs <- sample(morphShortVFreq$morph, m, replace = FALSE, prob=morphShortVFreq$tokens)
randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq.sh ./MC/morphShortVUnsegTokenSample/size",i*0.5,"k/randomization_",j,".txt ./stimuli-shortvowels.txt ./MC/morphShortVUnsegTokenPM/size",i*0.5,"k ./MC/morphShortVUnsegTokenPM/aux/size",i*0.5,"k ./MC/morphShortVUnsegTokenScore/size",i*0.5,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/morphShortVUnsegTokenScore/size",i*0.5,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/morphShortVUnsegTokenResult");setwd(wdResult)
outputAIC <- paste0("morphShortVUnsegTokenSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (12) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k, considering stimuli as unparsed items, weighted by their token frequency in running speech
dir.create("./MC/msMorphShortVUnsegTokenResult")
for(i in 1:7){
wdScore <- paste0("./MC/morphShortVUnsegTokenScore/size",i*0.5,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/msMorphShortVUnsegTokenResult");setwd(wdResult)
outputAIC <- paste0("msMorphShortVUnsegTokenSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (13) Monte Carlo analysis with NMS participants for morph set sizes N ranging from 0.5k to 3.5k, considering stimuli as unparsed items, drawing 1000 random samples of N-best morph types from the morph set weighted by their token frequency in running speech
dir.create("./MC/morphShortVUnsegTokenNBestSample");dir.create("./MC/morphShortVUnsegTokenNBestPM");dir.create("./MC/morphShortVUnsegTokenNBestPM/aux");dir.create("./MC/morphShortVUnsegTokenNBestScore");dir.create("./MC/morphShortVUnsegTokenNBestResult")
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenNBestSample/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenNBestPM/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenNBestPM/aux/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
  directoryName <- paste0("./MC/morphShortVUnsegTokenNBestScore/size",i*0.5,"k");dir.create(directoryName)
}
for(i in 1:7){
wdSample <- paste0("./MC/morphShortVUnsegTokenNBestSample/size",i*0.5,"k");m <- i*500;setwd(wdSample)
for(j in 1:1000){
morphShortVFreq <- morphShortVFreq[order(runif(nrow(morphShortVFreq))),]
morphShortVFreq <- morphShortVFreq[order(morphShortVFreq$tokens.raw, decreasing=TRUE),]
randomMorphs <- morphShortVFreq$morph[1:m]
randomMorphs <- gsub(""," ",randomMorphs);randomMorphs <- gsub("^ ","",randomMorphs);randomMorphs <- gsub(" $","",randomMorphs);outputFile <- paste0("randomization_", j, ".txt")
write.table(randomMorphs, outputFile, row.names = FALSE, col.names=FALSE, quote=FALSE)
shellCommand <- paste0("./score-nofreq.sh ./MC/morphShortVUnsegTokenNBestSample/size",i*0.5,"k/randomization_",j,".txt ./stimuli-shortvowels.txt ./MC/morphShortVUnsegTokenNBestPM/size",i*0.5,"k ./MC/morphShortVUnsegTokenNBestPM/aux/size",i*0.5,"k ./MC/morphShortVUnsegTokenNBestScore/size",i*0.5,"k")
system(shellCommand, intern = TRUE)
}
setwd("./../../..");wdScore <- paste0("./MC/morphShortVUnsegTokenNBestScore/size",i*0.5,"k");setwd(wdScore)
dataNMS <- dataExp2[dataExp2$group=="NMS",]
dataNMS <- dataNMS[c("enteredResponse","word","word1","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataNMS$item <- dataNMS$word1;dataNew <- merge(dataNMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/morphShortVUnsegTokenNBestResult");setwd(wdResult)
outputAIC <- paste0("morphShortVUnsegTokenNBestSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

# (14) Repeating the same Monte Carlo analysis with MS participants for morph set sizes N ranging from 0.5k to 3.5k, considering stimuli as unparsed items, drawing 1000 random samples of N-best morph types from the morph set weighted by their token frequency in running speech
dir.create("./MC/msMorphShortVUnsegTokenNBestResult")
for(i in 1:7){
wdScore <- paste0("./MC/morphShortVUnsegTokenNBestScore/size",i*0.5,"k");setwd(wdScore)
dataMS <- dataExp2[dataExp2$group=="MS",]
dataMS <- dataMS[c("enteredResponse","word","word1","length","macron","scoreMorphShortVType","workerId")]
randomScore <- function(list){
  list <- read.csv(list);list <- list[,c("item","logprob")]
  dataMS$item <- dataMS$word1;dataNew <- merge(dataMS, list, by="item")
  dataNew$length <- nchar(dataNew$item);dataNew$scoreNor <- dataNew$logprob/(dataNew$length + 1)
  return(dataNew)
}
filenames <- list.files(pattern="*.csv");files <- lapply(filenames, randomScore);listAIC <- list()
for(k in 1:1000){
    files[[k]]$enteredResponse <- as.factor(files[[k]]$enteredResponse)
    clmfit <- clm(enteredResponse ~ scoreNor + macron, data=files[[k]])
    listAIC[[k]] <- AIC(clmfit);message('Regression model of', k,'\n')
  }
setwd("./../../..");wdResult <- paste0("./MC/msMorphShortVUnsegTokenNBestResult");setwd(wdResult)
outputAIC <- paste0("msMorphShortVUnsegTokenNBestSize",i*0.5,"kListAIC.txt")
write.table(unlist(listAIC), outputAIC,row.names = FALSE, col.names=FALSE, quote=FALSE)
setwd("./../..")
}

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/morphTypeResult/morphTypeSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
morphTypeAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
morphTypeAIC$type <- "unweighted parsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/morphTokenResult/morphTokenSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
morphTokenAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
morphTokenAIC$type <- "frequency-weighted parsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/morphTokenNBestResult/morphTokenNBestSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
morphTokenNBestAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
morphTokenNBestAIC$type <- "N highest-frequency parsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/morphShortVUnsegResult/morphShortVUnsegSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
morphShortVUnsegAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
morphShortVUnsegAIC$type <- "unweighted unparsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/morphShortVUnsegTokenResult/morphShortVUnsegTokenSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
morphShortVUnsegTokenAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
morphShortVUnsegTokenAIC$type <- "frequency-weighted unparsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/morphShortVUnsegTokenNBestResult/morphShortVUnsegTokenNBestSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
morphShortVUnsegTokenNBestAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
morphShortVUnsegTokenNBestAIC$type <- "N highest-frequency unparsed"

nmsMorphAIC <- rbind(morphTypeAIC, morphTokenAIC, morphTokenNBestAIC, morphShortVUnsegAIC, morphShortVUnsegTokenAIC, morphShortVUnsegTokenNBestAIC)

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/msMorphTypeResult/msMorphTypeSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
msMorphTypeAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
msMorphTypeAIC$type <- "unweighted parsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/msMorphTokenResult/msMorphTokenSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
msMorphTokenAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
msMorphTokenAIC$type <- "frequency-weighted parsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/msMorphTokenNBestResult/msMorphTokenNBestSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
msMorphTokenNBestAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
msMorphTokenNBestAIC$type <- "N highest-frequency parsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/msMorphShortVUnsegResult/msMorphShortVUnsegSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
msMorphShortVUnsegAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
msMorphShortVUnsegAIC$type <- "unweighted unparsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/msMorphShortVUnsegTokenResult/msMorphShortVUnsegTokenSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
msMorphShortVUnsegTokenAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
msMorphShortVUnsegTokenAIC$type <- "frequency-weighted unparsed"

for(i in 1:7){
  assign(paste0("size",i*0.5,"k"),read.delim(paste0("./MC/msMorphShortVUnsegTokenNBestResult/msMorphShortVUnsegTokenNBestSize",i*0.5,"kListAIC.txt"), sep ="\t", header = FALSE))
}
size0.5k$size <- "0.5k";size1k$size <- "1k";size1.5k$size <- "1.5k";size2k$size <- "2k";size2.5k$size <- "2.5k";size3k$size <- "3k";size3.5k$size <- "3.5k"
msMorphShortVUnsegTokenNBestAIC <- rbind(size0.5k, size1k, size1.5k, size2k, size2.5k, size3k, size3.5k)
msMorphShortVUnsegTokenNBestAIC$type <- "N highest-frequency unparsed"

msMorphAIC <- rbind(msMorphTypeAIC, msMorphTokenAIC, msMorphTokenNBestAIC, msMorphShortVUnsegAIC,msMorphShortVUnsegTokenAIC, msMorphShortVUnsegTokenNBestAIC)

# write.table(nmsMorphAIC,"dataFigS8NMS.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
# write.table(msMorphAIC,"dataFigS8MS.txt",sep ="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)

# Run ordinal regression models with all morphs
clmfitMorphNMS <- clm(enteredResponse ~ scoreMorphShortVType + macron, data=dataNMS)
clmfitMorphMS <- clm(enteredResponse ~ scoreMorphShortVType + macron, data=dataMS)
# saveRDS(clmfitMorphNMS, file = "morphNMS.rds")
# saveRDS(clmfitMorphMS, file = "morphMS.rds")

3.12.2 Comparing AIC scores using cutoffs based on raw frequency of the words morphs appear in

Table S12: Comparing AIC scores with NMS participants, using cutoffs based on raw frequency of the words morphs appear in
Cutoff AIC
1002 103748.2
1020 103744.6
1042 103753.8
1068 103742.1
1090 103673.7
1122 103672.8
1151 103648.1
1190 103610.4
1229 103492.6
1276 103462.6
1339 103363.5
1395 103401.7
1475 103382.9
1629 103363.2
1813 103414.3
3636 103447.6
Table S13: Comparing AIC scores with MS participants, using cutoffs based on raw frequency of the words morphs appear in
Cutoff AIC
1002 36749.56
1020 36753.27
1042 36758.29
1068 36753.46
1090 36748.31
1122 36756.52
1151 36756.47
1190 36753.42
1229 36737.55
1276 36737.51
1339 36730.90
1395 36743.57
1475 36742.75
1629 36737.91
1813 36734.21
3636 36718.47

3.12.3 Code for plotting Fig. 4

Figure 4: Density plot of AIC scores for morph random samples

Figure 4: Density plot of AIC scores for morph random samples

3.13 Comparing AIC scores with NMS participants for model selection

  • scoreDictShortVType is the phonotactic score across word types in the dictionary ignoring vowel length distinction.
  • scoreNBestFixed is the phonotactic score from 3470 most frequent word types with the lowest AIC score obtained using cutoffs based on raw frequency.
  • scoreMorphShortVType is the phonotactic score across morph types derived from words in the dictionary ignoring vowel length distinctions.
  • scoreMorphNBestFixed is the phonotactic score from 1629 morph types with the highest token frequency in running speech with the lowest AIC, obtained using cutoffs based on raw frequency.
  • scoreMorphNBestWord is the phonotactic score derived from 1409 morphs which occur within 3470 words in the corpora.
  • scoreMorphShortVWord is the phonotactic score from 1303 morphs weighted by 3470 most frequent word types in the dictionary they appear in.
Table S14: Comparing AIC scores with NMS participants for model selection.
Score AIC
scoreDictShortVType 94266.43
scoreNBestFixed 94281.03
scoreMorphShortVType 93978.37
scoreMorphNBestFixed 93928.72
scoreMorphNBestWord 93980.88
scoreMorphShortVWord 94037.25

3.13.1 Self-reported exposure to Māori

NMS’ self-reported exposure to Māori is added as an additional predictor to the best ordinal mixed-effects model presented in Table S14.

Table S15: Ordinal mixed-effects model of well-formedness ratings including NMS’ self-reported exposure to Māori
Parameter Estimate Std. Error z.value Pr(>|z|)
Effects c.(scoreMorphNBestFixedNMS) 5.192 0.292 17.785 0.000 ***
macronTRUE 0.183 0.053 3.427 0.001 ***
c.(maoriExpo) 0.077 0.041 1.880 0.060 .
c.(scoreMorphNBestFixedNMS):c.(maoriExpo) 0.194 0.121 1.603 0.109
macronTRUE:c.(maoriExpo) 0.021 0.016 1.360 0.174
Thresholds 1|2 -1.712 0.091
2|3 -0.139 0.090
3|4 1.036 0.091
4|5 2.439 0.092

4 Post-questionnaire

  1. How well are you able to speak Māori?
    ☐ Very well (I can talk about almost anything in Māori)
    ☐ Well (I can talk about many things in Māori)
    ☐ Fairly well (I can talk about some things in Māori)
    ☐ Not very well (I can only talk about simple/basic things in Māori)
    ☐ No more than a few words or phrases
    ☐ Not at all

  2. How well are you able to understand/read Māori?
    ☐ Very well (I can understand almost anything said/written in Māori)
    ☐ Well (I can understand many things said/written in Māori)
    ☐ Fairly well (I can understand some things said/written in Māori)
    ☐ Not very well (I can only understand simple/basic things said/written in Māori)
    ☐ No more than a few words or phrases
    ☐ Not at all

  3. Which age group do you belong to?
    ☐ 18 - 29
    ☐ 30 - 39
    ☐ 40 - 49
    ☐ 50 - 59
    ☐ +60

  4. Please state your gender:

  5. Please state your ethnicity:

  6. Your highest education is:
    ☐ High school
    ☐ Undergraduate degree
    ☐ Graduate degree

  7. How often do you think you are exposed to the Māori language in your daily life, by means of Māori radio, Māori TV, online media? (only relevant for fluent Māori speakers and non-fluent Māori speakers of New Zealand English)
    ☐ Less than once a year
    ☐ Less than once a month
    ☐ Less than once a week
    ☐ Less than once a day
    ☐ Multiple times a day

  8. How often do you think you are exposed to Māori language in your daily life, in conversation at work, at home, in social settings? (only relevant for fluent Māori speakers and non-fluent Māori speakers of New Zealand English)
    ☐ Less than once a year
    ☐ Less than once a month
    ☐ Less than once a week
    ☐ Less than once a day
    ☐ Multiple times a day

  9. In the past five years, have you had any children living with you who have attended preschool or primary school in New Zealand? (only relevant for fluent Māori speakers and non-fluent Māori speakers of New Zealand English)
    ☐ Yes
    ☐ No

  10. Please tick all boxes that apply.
    ☐ I can give a mihi in Māori.
    ☐ I can sing a few songs in Māori.
    ☐ I can sing the NZ national anthem in Māori.
    ☐ I know how to say some basic phrases (e.g. My name is…, I’m from…) in Māori.
    ☐ I know how to say some commands (e.g. Sit down / Come here) in Māori.
    ☐ I know how to say some greetings in Māori.
    ☐ I know how to say some numbers in Māori.
    ☐ I know how to say some body parts in Māori.
    ☐ I know how to say some colors in Māori.

  11. What region of New Zealand do you live in currently? (Please choose ``overseas" if you are living outside of New Zealand.) (only relevant for fluent Māori speakers and non-fluent Māori speakers of New Zealand English)
    ☐ Northland
    ☐ Auckland
    ☐ Waikato
    ☐ Bay of Plenty
    ☐ Gisborne
    ☐ Hawke’s Bay
    ☐ Taranaki
    ☐ Wanganui
    ☐ Manawatu
    ☐ Wairarapa
    ☐ Wellington
    ☐ Nelson Bays
    ☐ Marlborough
    ☐ West Coast
    ☐ Canterbury
    ☐ Timaru - Oamaru
    ☐ Otago
    ☐ Southland
    ☐ Overseas

  12. How long have you been living there? (only relevant for fluent Māori speakers and non-fluent Māori speakers of New Zealand English)

  13. Please state your first language (the language you speak/use most of your time).

  14. What country were you living in when you first learned this language?

  15. Please list any other languages that you can speak fluently:

  16. Have you ever lived in Hawaii?
    ☐ Yes
    ☐ No

  17. Have you ever lived in New Zealand (only relevant for non-fluent Māori speakers of American English)?
    ☐ Yes
    ☐ No

  18. Do you speak/understand any Polynesian languages such as Hawaiian, Tahitian, Sāmoan, or Tongan?
    ☐ Yes
    ☐ No

  19. If you replied yes to question 18, please state the language you know.

  20. Do you have a history of any speech or language impairments that you are aware of?
    ☐ Yes
    ☐ No